Clustering Player Groups from the NBA Roster

1. Executive Summary

Basketball is a sport in which numerous statistics can be derived from players. From points and assists, to blocks and steals, each basketball game offers rich data that people can make use of to generate insights. The objective of this study was to cluster players based on their statistics and attempt to identify who the best players are in the NBA, and which other players are most similar to them.

Player data per game was collected from the Basketball Reference Website for players in the 2018-2019 NBA season. The data was cleaned and preprocessed. Some preprocessing done includes: delimiter rows were removed, duplicated entries due to player trades were consolidated, numeric columns that were interpreted as objects were cast as numeric (int or float).

Exploratory data analysis was performed for the purpose of dimensionality reduction. Although manual feature selection based on domain knowledge was used to reduce dimensionality, correlations were also taken into account in removing variables. Furthermore, principal component analysis (PCA) was done in order to identify the features which contribute more to the variance. However, PCA was not used for any purpose other than for deriving insights for the manual feature selection.

The data was scaled using the MinMaxScaler in order to mitigate the effect of variables with large magnitudes. KMeans clustering was performed on the players for each of the five most recent seasons of the NBA, and it was discovered that two clusters stood out each year in terms of efficiency: the star Point Guard cluster and the star Center cluster. This does not mean that players from other positions could not excel - rather, they were clustered into one of these two clusters.

We concluded that in order to find the best of the best in the NBA, coaches should look out for the players in the two star clusters each year. These clusters are provided in the Conclusion section for reference.

2. Introduction

2.1. Context

Basketball is one of the most popular sports in the United States. Although it is a sport that can be summarized as simply as ten people trying to put a ball through a hoop, this sport has grown extensively, from its humble beginnings in a Canadian gym to being one of the most watched sports in terms of attendance.

The U.S. National Basketball Association, or NBA, leads in total attendance among basketball leagues, and is widely considered to be the premier men's professional basketball league in the world.

During the course of an NBA game, various numerical measures of a players performance are taken, such as the points accrued, or the missed shots caught (known as rebounds), and etc. These statistics offer the best objective measure of a player's (or team's) performance.

2.2. Objective of the Study

The baseline requirement of this mini-project is to cluster NBA players based on their statistics.

As an additional objective, we attempted to identify the 'best' players in the NBA for the years 2019-2015 (reverse chronological order, prioritizing more recent years) through clustering.

Thus, the problem statement/question is as follows:

Using clustering based on player statistics, who are the best players in the NBA, and which other players are most similar to them?

Of course, it is entirely possible to simply aggregate all statistics into one catch-all statistic and regard it as a measure of a player's efficiency or overall contribution, then take the players with the top efficiencies as the "best players"

However, rather than simply taking the top players, our objective with clustering is to identify other players with star potential. We intend to do so by taking the "top clusters", or the clusters which contain the best players, and identifying the other, less known players who fall into the same cluster. Not all of these players will have the top stats, but the mere fact that they are clustered with the best players indicates that they are similar to them, and may imply that they have some star potential.

2.3. Methodology Summary

1. Data Collection

The data was collected from the Basketball Reference Website (https://www.basketball-reference.com/). The data in question is the player statistics per game.

2. Cleaning and Preprocessing

The data was be cleaned into a manageable format and preprocessed. Values were normalized.

3. Dimensionality Reduction

Each basketball fan, player, and coach may have a different interpretation of which stats are most important to them. Fans of more physical and defensive basketball may find blocks, BLK, to be important, and 3-point Field Goal Percentage, 3P%, to be less so. Thus, due to the subjective interpretation of what is important, three rounds of dimensionality reduction were performed.

4. Clustering

K-Means clustering was performed and the players were clustered. Clustering was done for each of the five seasons. Internal validation criteria were used to identify the best number of clusters to form.

3. Data Collection

3.0. Preliminaries

3.1. Scraping

First, the data will be collected from the Basketball Reference Website (https://www.basketball-reference.com/). The data in question will be the player statistics (stats), taken from the per-game player stats per season. The rationale for choosing per-game player stats is that taking total stats may produce unexpected results, because:

The urls to be used are in the format:

https://www.basketball-reference.com/leagues/NBA_xxxx_per_game.html

Where xxxx signifies the year (NBA season).

See notebooks/archive/Web Scraping Notebook.ipynb for code used to acquire the data.

To illustrate the preprocessing without having to scrape the website again, the raw DataFrame (as obtained from the website) was saved to a CSV file and read in the cell below.

3.2. Cleaning and Preprocessing

Now, data has been retrieved on about 734 players, with 30 columns of data. However, not all of these are player data. The tables on the page are not delimited by actually separating them, but by placing a delimiting row.

Let us look at a delimiting row erroneously read as a data row.

Thus, these have to be split into multiple tables by these rows. Let's find which of these rows are delimiters:

Now deleting them:

The next concern is that some of the players were traded (sent to another team for a certain transaction) during the season. These are represented on the table as duplicate rows, such as for Avery Bradley:

Mr. Bradley started the season with the Los Angeles Clippers, LAC, and ended it with the Memphis Grizzlies, MEM. His total stats for the season are represented as TOT. The objective is now to replace TOT with MEM, and drop the rows of LAC and MEM.

Now, the duplicated players are represented using their total stats, and their teams are the final ones.

Taking a look at the data types:

Since many are in object format, it is desired to convert them to numeric format whenever possible. Using to_numeric:

As indicated by the variable types above, the numeric objects have successfully been converted to integers or floats.

3.3. Storage and Retrieval

The function scrape_and_store allows us to scrape, clean/preprocess, and store data in an SQL table, for any year (as in the steps above).

The function read_year allows us to retrieve data for a particular year from the SQL table.

Both function codes may be found in nba_playerclusters/nba_playerclusters/functions.py.

4. EDA

4.1. Distributions of Variables

Prior to clustering it is important to gain insights on each of the variables. Simply performing dimensionality reduction on the data without understanding each of the variables is insufficient. In the following cell, the histogram of the distribution of each of the numeric variables will be plotted. It will be followed by an explanation of each variable, as well as a short statement on the distribution.

In anticipation of clustering, the distributions will need to be taken into account when choosing how to normalize each of the numeric variables; normally distributed variables will be normalized differently from skewed variables.

Variables

Categorical and Object (Not Included in Histograms Above)

Numeric (Included in Histograms Above)

It is important to analyze the distributions of the variables for the purpose of normalization (as stated above), but it was found that there may be another application to doing this: dimensionality reduction. At first glance, it might seem intuitive to retain only one variable from each group (for example, for Free Throws, drop FTA and FT% and keep FT only) but from the distributions, this may not be the case. For example, with 3-pointers, 3P, 3PA, and 3P% all have different distributions, and the same goes for 2P and FT.

Generally, it is observed that some of the variables are approximately normally distributed (such as 2P% and FG%), but most variables are positively skewed. Thus, with these skewed stats, most NBA Players average at lower values, but there are a few players who stand out. These stand-out players will be discussed in the final report.

4.2. Dimensionality Reduction Round 1. Manual Feature Selection

Removing Hybrid Positions

We see that there are very few players that are 'hybrid', or they are listed at two (rather than just one) of the five positions. In order to simplify analysis, we will replace each hybrid player's position with the first listed position. For example, Harrison Barnes, listed as PF-SF, will be labeled simply as a PF.

We can now perform our EDA by position.

4.2.1. Number of Players by Position

ANALYSIS

In the bar graph above, we see that Guard positions are the most numerous (particularly Shooting Guards), and the Small Forward is the least-populated position. Perhaps this is because SF are all-around, not particularly specialized towards playmaking (PG, SG) or physicality (C, PF).

Now, we will proceed to analyze the different numeric variables in the context of position, by using box plots.

4.2.2. Games, Games Started, and Minutes Played (G, GS, and MP)

ANALYSIS

For Games and Games Started, naturally, the lower and upper bounds are the same. However, we see that SF and C position players have a higher probability of starting a game, likely because there are fewer of them. Since SG are the most numerous, they have the lower likelihood of starting a game.

For Minutes Played, it immediately jumps out that C and PF players play for less minutes than the other positions. This is likely because of the physical nature of how they play, which tires them out faster.

4.2.3. 3P, 3PA, and 3P%

ANALYSIS

Immediately we see some outliers here that are perfect from beyond the 3-point line. However, this does not necessarily mean that these are the best or most efficient players in terms of 3-pointers. Instead, it is more likely that the players with high 3P% stats simply took too few shots to tease out their true shooting percentage.

For instance, we have Eric Moreland, who played the most games among the four above, but is waived (let go of by the team) almost every season he is in the NBA. Thus, we have two choices:

In keeping with the objective of the study, which is to identify the cluster of players who are the "best of the best", it is more feasible to pick the second option and drop players who played very few games.

ANALYSIS

Shooting Guards generally have the best 3-point percentage, and the most 3-pointers made. This is expected, as the Shooting Guards are the players which teams depend on to provide reliable shooting output. However, outliers in the Point Guard position (Stephen Curry a notable example) provide the absolute maximum output in terms of 3-pointers.

Based on the distributions above, 3-pointers are a good stat to classify players as the distribution is spread wide, offering good contrast/differentiation between players.

With the exception of Willie Cauley-Stein, who obtained a high 3P% due to only making two 3PA and successfully making one 3P for the entire season, all of the Top 10 3-Point Scorers above now seem like they deserve to be there.

4.2.4. 2P, 2PA, and 2P%

ANALYSIS

Center players generally have the best 2-point percentage, and the most 2-pointers made. This is because Center players typically have the power and strength to force plays and shots 'in low' (close to the basket). These up-close-and-personal plays have higher percentages than shots further out, which shooting guards tend to take.

4.2.5. eFG%

ANALYSIS

Despite adjusting for the importance of 3-point shots, Center players are the most efficient scorers as far as eFG% is concerned.

Using this statistic for clustering will increase the emphasis on field goal percentage. It is important to note that 2P% and 3P%, which are already accounted for, will be considered once again when using eFG%. Since eFG% is probably a better reflection of a player's scoring efficiency, we will drop FG% and keep eFG%.

Also, we will drop FG and FGA since these are already reflected in 2P, 3P, 2PA, and 3PA.

4.2.6. FT, FTA, and FT%

ANALYSIS

Guards (Shooting Guards and Point Guards) tend to have the best Free Throw percentages.

Free throw statistics might not be apt to include in the feature array for clustering later on, since we see that the statistics are fairly similar in terms of distribution across positions, thus possibly reducing contrast.

We will drop Free Throw Statistics in order to reduce the noise in the dataset, and focus on more important features

4.2.7. TRB, DRB, and ORB

ANALYSIS

Center players have the most Total Rebounds, and also have the most rebounds of each kind (ORB and DRB). Since Center players stay the closest to the basket (from which missed shots tend to bounce off of), are the tallest in the league, and also tend to be the strongest, they are naturally inclined to be the players who are able to catch the most rebounds.

We also notice that the distributions for all three variables above look similar. Thus, we can drop at least one of the three. We will use correlations to determine which one/s to drop.

4.2.8. AST

ANALYSIS

Naturally, Point Guards are the AST leaders, since they are responsible for handling the ball and calling plays (strategies) for the team. The Point Guards usually make the passes to the players.

Since no other statistic is similar to AST, it will definitely be used for clustering.

4.2.9. STL

ANALYSIS

Point Guards are the predominant stealers in the NBA.

Once again, no other statistic is similar to STL. Thus, it will be used for clustering.

4.2.10. BLK

ANALYSIS

Center players are the predominant shot blockers in the NBA, owing to their large stature.

Like with steals, no other statistic is similar to BLK. Thus, it will be used for clustering.

4.2.11. TOV

ANALYSIS

Because Point Guards usually handle the ball, they also incur the most turnovers.

Like with STL and BLK, no other statistic is similar to TOV.

However, with TOV, a lower number is preferable, because losing the ball to the opposing team is undesirable. Thus, we will reverse turnovers after scaling and call this statistic ITO (Inverse Turnovers).

4.2.12. PF

ANALYSIS

Center and Power Forward players incur the most fouls per game, likely due to the physical nature of the way they play. This increases the possibility of committing fouls against the opposing players.

Like with TOV, a lower number is preferable, and we will reverse PF after scaling and call this statistic IPF (Inverse Personal Fouls).

4.2.13. PTS

ANALYSIS

Despite the connotation of the word 'Shoot' in 'Shooting Guard', we see that the PTS contribution is fairly similar across positions. However, we cannot drop this stat as domain knowledge tells us that PTS are extremely important, owing to the fact that games are won on the basis of total team PTS.

Thus, we will weight this statistic and arbitrarily consider it to be twice as important as the other stats, after scaling.

5. Other Dimensionality Reduction

5.1. Dimensionality Reduction Round 2: Correlations

A problem when working with data is multicollinearity - a state of very high intercorrelations or inter-associations among the variables. In order to identify and possibly remove variables that are highly correlated with each other, we will visualize the relationships between the variables.

Pair plotting would have been performed to attempt to identify if relationships exist between the variables. However, doing this will be computationally expensive, and will be very difficult to read due to the high dimensionality. Instead, we will use a heatmap to attempt to visualize linear correlations between the variables. This kind of visualization can show the correlations more concisely, at the cost of not displaying the scatter plots.

ANALYSIS

In addition to the variables that were to be dropped as discussed in the previous section of the EDA, for each pair of variables with a correlation above 90%, we will drop at most one. However, deciding which of the two we will drop requires some discussion.

In addition, all of the statistics have positive connotations - that is to say, it is desirable for a player to increase the values of these variables, except for TOV (Turnovers) and PF (Personal Fouls), which players wish to reduce. Thus, we will construct two new variables, Inverse Turnovers (ITO), and Inverse Personal Fouls (IPF), which are the inverses of TOV and PF, respectively.

5.3. Dimensionality Reduction Round 3: Principal Components

At this point, we will use PCA to identify the principal components of the dataset. However, we may or may not push through with actually using PCA on the final feature array.

ANALYSIS

Based on the graph of cumulative variance explained, we need 5 components to explain 80% of the variance.

ANALYSIS

2PA, FTA, 2P, and FG are quite close to the principal component PC1, meaning that they explain a larger amount of variance in the data. Also, G and GS are close are well. This was to be expected on the part of FG, but is quite surprising on the part of G and GS. It is possible that G and GS are also an indication of how valuable a player is, since coaches are more willing to let them play more (or less) games.

Also, the Field Goal Percentage Measures such as FG% and 3P% are closest to perpendicularity to PC1, implying that they may not be as effective at explaining variance. This is likely because players can have high percentages without providing much output (such as our example earlier with Eric Moreland).

VERDICT

Revise manual feature selection as follows:

However, we will proceed to drop the attempts made such as 2PA because in determining valuable players, we are more concerned with how many shots were actually made rather than the attempts.

Dropping the above features:

6. Clustering

6.1. Scaling

When clustering without scaling or normalization, features having large ranges will implicitly assign greater efforts in the metrics compared to the application with features having smaller ranges. (Aksoy and Haralick, 2001)

Thus, some form of scaling is required to reduce the features into dimensionless data.

Since the histograms produced in the EDA showed some features that were not normally distributed, we scaled the data using the sklearn.preprocessing.MinMaxScaler, which transforms features by scaling each feature to a given range. It scales and translates each feature individually such that it is between zero and one.

First five rows of the scaled feature array:

6.2. Visualization Using TSNE

Next, we use TSNE, which is an algorithm that allows us to visualize high-dimensionality datasets in two dimensions. We assign a random_state in order to provide consistency of results across executions of the notebook.

We reduce the feature array to two dimensions using TSNE, then visualize using a scatter plot:

ANALYSIS

Although we will not be clustering based on position, we can see that there is some distinction between positions even when the data is plotted using TSNE. This implies that different positions have different playing styles, although this is an affirmation rather than a discovery.

6.3. Clustering 2018-2019 Season Players with KMeans

We will perform a cluster range, meaning that we will cluster using various values of k in order to generate a plot of internal validation criteria, which will allow us to select an optimum k. We set a random_state in order to provide consistency of results across executions of the notebook, and performed the cluster range from 2 up to 10 clusters only to keep the number of clusters parsimonious.

ANALYSIS

We can see that the clustering algorithm is able to do a good job of clustering the players. Despite the feature array having a dimensionality greater than 2, the separation of clusters is captured well in the TSNE plots.

However, visual inspection is not always a satisfactory method for determining the optimum number of clusters. In order to provide a more objective measure of the optimum number of clusters, we will plot the internal validation criteria.

ANALYSIS

VERDICT

Try picking 7 clusters as suggested by Intra-Inter.

Unique clusters:

Number of players in each cluster:

It was stated in our objectives for this work that we aim to identify the clusters containing the best players. However, how do we know which cluster (or clusters) that is? For that, we will use the Player Efficiency Rating (PER).

6.3.1. Player Efficiency Rating (PER)

The PER is an all-in-one basketball rating, developed by John Hollinger, which attempts to summarize a player's overall contribution or efficiency in a single number. The formula is as follows:

Yes, it is quite long, and inasmuch as this statistic is very tedious to encode and calculate, we will use a linear approximation to the PER, formulated by Bleacher Report's Zach Fein, which uses the following coefficients:

Now, we will obtain the distributions of the players' PER per cluster and identify the cluster/s that are better than the rest:

ANALYSIS

If Player Efficiency Rating is used as the basis, we can see some differentiation between clusters. It can be seen that clusters 1 and 6 have higher PER than the other clusters, notably cluster 4 which has a dismal mean PER of about 11.

Of course, we could have simply clustered by PER. Although one may argue that this statistic already provides a weighted aggregation of most of the other statistics, doing so is somewhat one-dimensional, as players who are in close proximity by virtue of the PER are not necessarily as similar in terms of all statistics.

Now that we have identified that our best clusters are clusters 1 and 6, we will see what positions comprise each of the clusters, then construct a word cloud of the player names in each cluster to get a brief glimpse of who is in which cluster. A write-up will be provided after the two visualizations.

ANALYSIS

Based on our findings that Clusters 1 and 6 have the best overall statistics, and each of the two clusters is mostly dominated by a single position (Point Guard for Cluster 6 and Center for Cluster 1), we might conclude that the 2018-2019 season is the season of the Point Guards and Centers. This is evidenced by the popularity of centers such as Anthony Davis (Cluster 1), and Point Guards such as Stephen Curry (Cluster 6). Contradicting this claim though, is the awarding of league MVP Giannis Antetokounmpo (Cluster 1), who plays at the Power Forward position.

6.4. Clustering 2017-2018 Season Players with KMeans

In the succeeding subsections under Section 6, we will perform clustering on the players for the previous four seasons (2018, 2017, 2016, and 2015) in reverse chronological order. However, the write-ups for the previous seasons will not be as comprehensive as those for the 2019 season.

In the following cell, we perform all operations done to the 2019 player data in Section 6, to the 2018 player data, then perform clustering to see if the trends remain the same.

ANALYSIS

VERDICT

Try picking 5 clusters as suggested by Intra-Inter, and in the interest of parsimony.

Unique clusters:

ANALYSIS

Clusters 0 and 3 are the 'star' clusters of this season. From the position bar plots we see that cluster 0 is still the cluster of the Center players, but we see that for cluster 3, although Point Guards are still the most numerous in the best cluster, Shooting Guards come close behind. So we might say that 2017-2018 is the season of the Point Guards, Shooting Guards, and Centers.

6.5. Clustering 2016-2017 Season Players with KMeans

ANALYSIS

VERDICT

Try picking 7 clusters as suggested by Intra-Inter.

ANALYSIS

Clusters 2 and 3 are the 'star' clusters of this season. We see a return of the Point Guards and Centers, but curiously, the second-most frequent position in cluster 2 is the Small Forward.

So we might say that 2016-2017 is the season of the Centers, Point Guards, and Small Forwards.

6.6. Clustering 2015-2016 Season Players with KMeans

ANALYSIS

VERDICT

Try picking 5 clusters as suggested by Intra-Inter, and in the interest of parsimony.

ANALYSIS

Clusters 3 and 4 are the 'star' clusters of this season. Cluster 3 remains the cluster of the Centers, but Power Forwards come at a close second in this cluster. Also, Shooting Guards make their return as second to the Point Guards in Cluster 4.

6.7. Clustering 2014-2015 Season Players with KMeans

ANALYSIS

VERDICT

Try picking 5 clusters as suggested by Intra-Inter, and in the interest of parsimony.

ANALYSIS

Clusters 3 and 4 are the 'star' clusters of this season. No different trends are seen this season.

7. Conclusions

7.1. General Findings

For each of the past five seasons, we have identified two standout clusters which contain players more highly-rated than those of the other clusters. By plotting the player positions in each of these clusters, we have discovered that one of the two clusters is composed mainly of Centers (sometimes with Power Forwards), and the other cluster is composed mainly of Point Guards (as well as Shooting Guards). This means that although players of both positions may stand out, they cannot be clustered together due to the fundamental differences in their playstyles.

There are two different possible interpretations for this:

  1. We are in the era of the star Point Guards and Centers. In recent times, Point Guards and Centers have seen a climb in popularity, especially in the likes of star Center player Tim Duncan and 3-point specialist Stephen Curry. However, this theory does not account for the fact that star players also exist in other positions, particularly the Small Forward position in the persons of all-around player LeBron James, or in Kevin Durant, or even in recent Finals MVP Kawhi Leonard. On the other hand, it is also entirely possible that this theory holds, since the three aforementioned star SF players tend to be clustered in the star Point Guard cluster each year. This means that while exceptional players exist at the other positions, in terms of number they are trumped by the Point Guards and Centers.

  2. Since PG and C represent two opposing ends of the play spectrum, they may represent 'model' players. This would explain why Shooting Guards are the second-most frequent position in the star PG cluster, and PF is the second-most frequent position in the star C cluster. Since SF is mid-way between the two, it appears the least frequently.

7.2. The Best of the Best

In this section, we will show the word clouds of the two best clusters for each of the five seasons. Our recommendation to coaches of NBA teams would be to watch out for the players that appear in these clusters. It is important to note that some of these players may have lower PER than some other players in the other clusters. However, the mere fact that they are clustered together with the best players indicates that they are similar to them in some way, and may imply that they have some star potential in them.

7.2.1. 2018-2019 Season Players to Watch Out For

7.2.2. 2017-2018 Season Players to Watch Out For

7.2.3. 2016-2017 Season Players to Watch Out For

7.2.4. 2015-2016 Season Players to Watch Out For

7.2.5. 2014-2015 Season Players to Watch Out For

8. References